iT邦幫忙

2022 iThome 鐵人賽

DAY 11
0
自我挑戰組

用Python學習網路爬蟲30天系列 第 11

[Day11] 走訪HTML網頁

  • 分享至 

  • xImage
  •  

走訪HTML網頁取得資料

我們除了可以使用Beautiful Soup中的特定屬性來幫助走訪網頁,也可以使用物件樹或上一個/下一個元素來走訪剖析HTML網頁的標籤順序。

  • Python物件樹走訪:
    Python物件樹使以類似階層上下樓梯的方式走訪標籤。下方以FJU_p4.html左半邊內容< section class=”left” >的html格式內容為例:
    https://ithelp.ithome.com.tw/upload/images/20220925/20152180HnKuZvPlc2.png

    1. 向下走訪: 從 < section.left > -> < table> -> < tr > -> < td >

    2. 向上走訪: 從 < td > -> < tr > -> < table > -> < section.left >

    3. 兄弟走訪: 即同一層的走訪。從 < p > -> < br > 和 < br > -> < p >

  • 上一個和下一個元素走訪:
    https://ithelp.ithome.com.tw/upload/images/20220925/20152180emgnaBwOoo.png

    1. 下一個元素走訪: 目前標籤物件立即的下一個物件。
      例. < section.left > -> < p > -> < br > -> < p > -> …
    2. 上一個元素走訪: 目前標籤物件立即的上一個物件
      例. < tr > -> < tbody > -> < table > -> < br > -> …

實作練習

以FJU_p4.html中的左半邊內容< section class=”left” >為例擷取指定的文字或標籤內容。

  1. 向下走訪
    (1) 使用子標籤名稱

    from bs4 import BeautifulSoup 
    
    with open("FJU_p4.html", "r", encoding="utf8") as fp:
        soup = BeautifulSoup(fp, "lxml")
    
    # 使用屬性向下走訪
    print(soup.html.head.title.string)
    print(soup.html.head.meta["charset"])
    
    # 使用table屬性取得第1個<td>標籤
    print(soup.html.body.section.table.tr.td.string)
    

    https://ithelp.ithome.com.tw/upload/images/20220925/20152180NTpDk1KeXu.png

    (2) 使用children屬性取得所有子標籤

    from bs4 import BeautifulSoup
    from bs4.element import NavigableString 
    
    with open("FJU_p4.html", "r", encoding="utf8") as fp:
        soup = BeautifulSoup(fp, "lxml")
    
    # 使用屬性取得所有子標籤
    tag_div = soup.select("table") # 找到<table>標籤
    tag_tr = tag_div[0].tr      # 走訪到之下的<tr>           
    for child in tag_tr.children:
    if not isinstance(child, NavigableString):
        print(child.name)
        for tag in child:
            if not isinstance(tag, NavigableString):
                print(tag.name, tag.string)
            else:
                print(tag.replace('\n', ''))
    

    https://ithelp.ithome.com.tw/upload/images/20220925/20152180QNgirPRV6h.png

  2. 向上走訪
    (1) 走訪父標籤

    from bs4 import BeautifulSoup 
    
    with open("FJU_p4.html", "r", encoding="utf8") as fp:
        soup = BeautifulSoup(fp, "lxml")
    
    tag_table = soup.select("table") # 找到<table>
    tag_td = tag_table[0].tr       # 走訪到之下的<td>
    
    # 使用屬性取得父標籤
    print(tag_td.parent.name)
    

    https://ithelp.ithome.com.tw/upload/images/20220925/20152180QOvLLP19yd.png

    (2) 走放祖父標籤

    from bs4 import BeautifulSoup
    
    with open("FJU_p4.html", "r", encoding="utf8") as fp:
        soup = BeautifulSoup(fp, "lxml")
    
    tag_table = soup.select("table") # 找到<table>標籤
    tag_td = tag_table[0].tr       # 走訪到之下的<td>
    
    # 使用函數取得所有祖先標籤
    for tag in tag_td.find_parents():
        print(tag.name)
    
    

    https://ithelp.ithome.com.tw/upload/images/20220925/2015218052MMY35lnq.png

  3. 兄弟走訪
    (1) 走訪下一個兄弟標籤使用next_sibling屬性和find_next_sibling()函數

    from bs4 import BeautifulSoup 
    
    with open("FJU_p4.html", "r", encoding="utf8") as fp:
        soup = BeautifulSoup(fp, "lxml")
    
    tag_section = soup.select(".left") # 找到class=left的標籤
    first_td = tag_section[0].table.tr  # 第1個<td>
    print(first_td)
    
    # 使用next_sibling屬性取得下一個兄弟標籤
    second_td = first_td.next_sibling.next_sibling
    print(second_td)
    
    # 呼叫next_sibling()函數取得下一個兄弟標籤
    third_td = second_td.find_next_sibling()
    print(third_td)
    print("---------------------------------------")
    
    # 呼叫next_siblings()函數取得所有兄弟標籤
    for tag in first_td.find_next_siblings():
        print(tag.name, tag.td.string)
    

    https://ithelp.ithome.com.tw/upload/images/20220925/201521801uz90f9Jxj.png

    (2) 走訪上一個標籤使用previous_sibling屬性和find_previous_sibling()函數

    from bs4 import BeautifulSoup 
    
    with open("FJU_p4.html", "r", encoding="utf8") as fp:
        soup = BeautifulSoup(fp, "lxml")
    
    tag_section = soup.select(".left") # 找到class=left的標籤
    tag_td = tag_section[0].table.tr # 第1個<td>
    fifth_td =  
    tag_td.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling
    
    # 使用previous_sibling屬性取得前一個兄弟標籤
    fourth_td = fifth_td.previous_sibling.previous_sibling
    print(fourth_td )
    
    # 呼叫previous_sibling()函數取得前一個兄弟標籤
    third_td = fourth_td.find_previous_sibling()
    print(third_td)
    print("---------------------------------------")
    # 呼叫previous_siblings()函數取得所有兄弟標籤
    for tag in fifth_td.find_previous_siblings():
        print(tag.name, tag.td.string)
    

    https://ithelp.ithome.com.tw/upload/images/20220925/20152180u2eN0IuVTZ.png

  4. 下一個元素走訪:使用next_element屬性

    from bs4 import BeautifulSoup 
    
    with open("FJU_p4.html", "r", encoding="utf8") as fp:
        soup = BeautifulSoup(fp, "lxml")
    
    tag_section = soup.section # 找到第<section>標籤
    print(type(tag_section), tag_section.name)
    tag_next = tag_section.next_element.next_element # 找到第<section>標籤的下一個元素
    print(type(tag_next), tag_next.name)
    
    

    https://ithelp.ithome.com.tw/upload/images/20220925/201521808GzSNpGTjA.png

  5. 上一個元素走訪:使用previous_element屬性

    from bs4 import BeautifulSoup 
    
    with open("FJU_p4.html", "r", encoding="utf8") as fp:
        soup = BeautifulSoup(fp, "lxml")
    
    tag_tr = soup.tr # 找到第<tr>標籤
    print(type(tag_tr), tag_tr.name)
    tag_previous = tag_tr.previous_element.previous_element # 找到第<tr>標籤的上一個元素
    print(type(tag_previous), tag_previous.name)
    

    https://ithelp.ithome.com.tw/upload/images/20220925/20152180xMohCSUVLF.png


上一篇
[Day10] 擷取靜態HTML網頁資料5_XPath表達式
下一篇
[Day12] 資料儲存成檔案
系列文
用Python學習網路爬蟲30天30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言